WIP: Merge Dev to Main#2846
Merged
Merged
Conversation
…size resolution - introduce `_apply_chunk_size_overlay` to reconcile `chunk_token_size` and `chunk_overlap_token_size` across config tiers - change `chunk_token_size` and `chunk_overlap_token_size` fields to `Optional[int]` with `None` default - update `default_chunker_config` to only read strategy-specific env vars, leaving slots empty for overlay fallback - add precedence chain: addon_params explicit > strategy env > legacy constructor field > legacy env - back-fill legacy instance fields after resolution for backward compatibility with downstream readers - update Chinese documentation to reflect new configuration hierarchy and priority rules - add comprehensive tests covering constructor overlay, addon_params precedence, strategy env wins, and legacy fallback
…h semantic strategy - introduce CHUNK_P_SIZE env variable to decouple P strategy chunk size from global CHUNK_SIZE - update default_chunker_config to parse and inject CHUNK_P_SIZE into paragraph_semantic options - modify pipeline to extract and apply per-strategy chunk_token_size for P strategy with fallback to resolved top-level size - document new env variable and configuration in Chinese docs with usage guidance - add tests verifying env override behavior and fallback to global chunk size when unset
- add upper version bounds for langchain-text-splitters (<2) and langchain-experimental (<1) - remove duplicate langchain 1.x and langchain-core 1.x entries from uv.lock - add missing explicit dependencies (defusedxml, langchain-experimental, langchain-text-splitters) to api/evaluation/offline/test extras - pin async-timeout to 4.0.3 for python < 3.11 to resolve version conflicts
…e file processing documentation - reorganize document with numbered sections for server deployment workflow - add quick start section with legacy, native, and combined configuration examples - introduce detailed chunk_options configuration with environment variable reference - add new chapter for python sdk usage covering runtime api and deprecated parameters - improve clarity on engine fallback, validation, and priority chains - relocate and expand storage layout, duplicate detection, concurrency, and resume rules sections - add appendix for upgrade notes regarding deprecated multimodal global switch
…n params - ensure chunk size configuration is reconciled when runtime addon params are set - maintain consistency across all four configuration tiers
feat(chunker): add R/V chunkers and chunk_options snapshot mechanism
- move extraction-related settings below multimodal parsing section - uncomment CHUNK_P_SIZE to set default value of 3000 - improve logical grouping by placing docling settings before extraction configs
…rategies - introduce CHUNK_R_SIZE env variable for recursive character chunker - introduce CHUNK_V_SIZE env variable for semantic vector chunker - update env.example with new per-strategy size options and documentation - modify pipeline to pop and apply strategy-specific chunk_token_size - add tests for dedicated env override and fallback behavior for both R and V
- add _format_chunking_log helper to emit concise, scannable log lines - alias long parameter keys to short forms for readability - skip None and empty values to keep output compact - log before each chunking strategy call (P, R, V, F, F(legacy)) - include chunk size, relevant params, and file path in every log line
…ptions - document `CHUNK_R_SIZE` and `CHUNK_V_SIZE` environment variables - add strategy-specific size fields to recursive_character and semantic_vector examples - update priority chain to include new R and V size env variables - clarify R size favors smaller targets for sentence splitting and V size acts as advisory ceiling
… to doc metadata - rename _format_chunking_log to _format_chunking_params for reuse in both logging and metadata - add chunk_opts_str to capture and persist actual chunker params to doc_status.metadata - include chunk_opts in _DOC_STATUS_METADATA_CARRY_OVER_KEYS for visibility across status transitions
- replace three separate metadata fields with single compact string - keep same information in "pre -> post" format while reducing noise - signal split occurrence by field presence alone
- P chunker: anchor-less branch falls back to recursive_character splitting so chunk_token_size is honored even when no eligible paragraph anchor is available (e.g. dense academic prose). Previously the block was emitted as a single oversized chunk and relied on the embedding-time hard fallback, which uses embedding_token_limit (not chunk_token_size) and cannot enforce the user-configured size. - V chunker: extend default sentence_split_regex to recognize CJK sentence terminators (。?!) so SemanticChunker actually produces sentences on Chinese / mixed-language input instead of treating the whole document as one. Add post-split size enforcement via R for any piece exceeding chunk_token_size, since SemanticChunker has no native size cap. - R chunker: extend default separators with CJK punctuation (。!?;,) so Chinese documents split at semantic boundaries instead of falling through to character-level splitting. English '.?!' intentionally excluded — literal match would split numerals (0.95) and abbreviations (e.g.). - Expose CHUNK_V_SENTENCE_SPLIT_REGEX env var (alongside existing CHUNK_R_SEPARATORS) so users can customize per deployment. - Move shared defaults (DEFAULT_R_SEPARATORS, DEFAULT_SENTENCE_SPLIT_REGEX) to constants.py as the single source of truth. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ment Sync FileProcessingConfiguration-zh.md with the chunker fixes: - §2.5 process options table: explain new R default cascade (CJK punctuation tier), V's CJK-aware sentence splitter and post-split R-based size enforcement, and P's anchor-less fallback to R. - §3.2 env vars table: update CHUNK_R_SEPARATORS default, switch CHUNK_V_SIZE description from "advisory ceiling" to "hard cap", and document the new CHUNK_V_SENTENCE_SPLIT_REGEX env var. - §3.4 chunk_options JSON example: reflect new R separators default and add semantic_vector.sentence_split_regex field. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- add "breakpoint" to "break" alias - add "buffer" to "buf" alias - add "sentence_split_regex" to "regex" alias
fix(chunker): CJK punctuation support and chunk_token_size enforcement
- remove redundant alias mappings for "breakpoint" and "buffer" - shorten "breakpoint_threshold_type" alias from "breakpoint" to "break" - shorten "buffer_size" alias from "buffer" to "buf"
- remove duplicate blocks_path from p_opts passed to _format_chunking_params - prevent potential key collision since blocks_path is already extracted separately
…ic chunker - reduce CHUNK_P_SIZE in env.example from 3000 to 2000 for consistency - update chunk_token_size default in paragraph_semantic.py from 1200 to 2000
- merge AGENTS.md content into CLAUDE.md and remove duplicate file - update project structure to reflect current module layout - add workspace isolation details and pipeline concurrency contract - include WebUI commands, testing scripts, and setup wizard outputs - remove redundant sections and streamline common issues
- rename CLAUDE.md to AGENTS.md for generic AI agent usage - replace full CLAUDE.md content with reference to AGENTS.md - update .gitignore to use broader "AI Agent files" terminology
… fallback - detect table format (json / html / unknown) via explicit format= attribute, fall back to body sniffing when attrs are silent - split json tables on top-level row items and html tables on <tr> boundaries; only when no row boundary is available, or a single row alone exceeds the cap, drop to character-level fallback - apply the same table-aware fallback in stage C anchor-driven long-block re-split so non-table residuals are character-split while oversized tables retain row integrity - tests cover detect / html row extraction / json splitting / combined dispatcher; existing _expand_block_with_table_splits paths unchanged
- account for table wrapper overhead in row splitter budgets to prevent post-wrap overflows - add recursive re-splitting for table chunks that still exceed target_max after wrapping - debit newline separator tokens in no-anchor greedy packing to enforce target_max strictly - add tests for separator token accounting and wrapper overhead budgeting
- fix missing newline at end of file to follow POSIX standard
- remove single-paragraph early return and recursive guard to allow character-level splitting of oversized single paragraphs - re-measure joined content after separator tokens in tail absorption to prevent silent overflow - disable chunk overlap in recursive character fallback to honor non-overlapping contract - add regression tests for merge boundary checks, single-paragraph split, and fallback overlap behavior
…TNxo8 feat(opensearch): add basename and content_hash lookups for doc status
…tion - eliminate unnecessary `:-/` fallback in redis uri path capture - ensure exact path preservation from original uri during local service normalization
… setup scripts - add /app/data/prompts directory creation in dockerfile and dockerfile.lite - add PROMPT_DIR environment variable and volume mounts in all compose files - update setup scripts to support PROMPT_DIR configuration and idempotent mount injection - fix redis test default uri to remove trailing slash
- consolidate verbose log strings in parse_mineru and parse_docling to reduce noise - shorten analyze_multimodal opt-in missing and backfill log lines for clarity - remove redundant file_path references from completion and cache hit logs - update chinese documentation to match simplified log format
… methods - remove default implementations of get_doc_by_file_basename and get_doc_by_content_hash - add @AbstractMethod decorator to enforce implementation in subclasses - clean up unused asdict import from dataclasses module - simplify docstrings to reflect abstract nature of methods
…ation - correct the info log message format for empty equations sidecar in analyze_multimodal
- replace specific entity_type subdirectory with entire prompts directory - update comment to reflect user customized prompt directory purpose
- disable default memgraph port exposure for improved security in template - allow users to opt-in to port exposure via environment configuration if needed
- replace file_path with doc_id in chunking log messages for better traceability - apply consistent logging format across all chunking strategies (P, R, V, F, legacy)
- change ignored path from entire prompts directory to specific entity_type subdirectory - add documentation for user-defined prompts folder purpose
- clarify default behavior when ENTITY_EXTRACTION_USE_JSON is unset - improve description of json output trade-offs with latency and reliability
…load feedback Backend: - /health derives pipeline_active = busy || scanning || destructive_busy || pending_enqueues > 0 - Also exposes pipeline_scanning / pipeline_destructive_busy / pipeline_pending_enqueues - Closes the gap where the scan classification phase set only `scanning` and the pipeline-busy button stayed grey for 5~10s Frontend: - Add activity probe: exponential-backoff /health bursts at t=0/1/2/4/8/16s fired by scan_started and the first successful upload in a batch. Exits as soon as both pipelineActive=true AND the document list has caught up. - Add refreshDocumentsThrottled(): wall-clock 2s minimum between any two /documents/paginated requests, with trailing-call coalescing. - Scan/upload no longer rely on resetHealthCheckTimerDelayed + adhoc fast polling windows — probe + active polling cover both paths. - Polling stays at 5s while pipelineActive=true even if doc list hasn't surfaced new rows yet, so the 30s idle gap right after scan disappears. - Stale trailing refresh is dropped via latestRefreshRequestVersionRef check so 2s-window page/filter/sort changes can't be overwritten by a captured old query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(pipeline-status): probe + throttled refresh for prompt scan/upload feedback
Add English versions of the three Chinese-only docs: - FileProcessingPipeline.md - LightRAGSidecarFormat.md - ParserDebugCLI.md https://claude.ai/code/session_01PEf2XkGrpo79D43GVPWn3G
…sh-NXghN' into dev
- remove legacy upgrade appendix about deprecated global multimodal switch - keep both chinese and english documentation in sync
- add RagAnything merge announcement with MinerU/Docling support - document four new text chunking strategies - add role-specific LLM configuration details
P (paragraph_semantic) chunking now uses DEFAULT_CHUNK_P_SIZE (2000) when CHUNK_P_SIZE env is unset, instead of silently inheriting the global CHUNK_SIZE / LightRAG(chunk_token_size=...). Paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together; inheriting the smaller global ceiling defeats the strategy's purpose. Precedence (high → low): caller-supplied paragraph_semantic.chunk_token_size > CHUNK_P_SIZE env > DEFAULT_CHUNK_P_SIZE (2000) The backfill lives in slim_chunk_options() — the single chokepoint shared by both enqueue paths (resolve_chunk_options + caller-supplied chunk_options=). _apply_chunk_size_overlay() carries a mirror backfill so direct addon_params introspection sees the resolved value too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-default feat(chunker): give P strategy a dedicated default chunk_token_size
…d modalities - remove idempotent skip logic for existing llm_analyze_result entries - overwrite prior success/skipped/failure results on each run for enabled modalities - allow retry after fixing vlm/extract configuration without manual sidecar cleanup - rely on llm analysis cache to avoid redundant provider calls when inputs unchanged - update docs and tests to reflect new non-idempotent overwrite behavior
… response logging - add `raise_for_status_with_detail` and `response_error_detail` helpers to `_common.py` - replace ad-hoc status checks in docling and mineru clients with unified helper - include compact response body snippets in error messages for faster debugging - add test coverage for HTTP error preservation and non-2xx handling in docling and mineru
…format
- add lightrag_load_errors collection to track blocks.jsonl read failures
- skip documents with unreadable blocks instead of creating false "{{LRdoc}}" entries
- flush failed stubs via apipeline_enqueue_error_documents inside critical section
- return track_id on failure-only batches instead of None to prevent silent archival
- expose file_size and original_error in failure records for better debugging
- add reference to FileProcessingPipeline.md documentation for parser setup - change example LIGHTRAG_PARSER from commented to active with new default pattern - update parser pattern to use native-teP and legacy-R fallbacks
…d paragraph semantic chunking documentation - update both zh and en quick start sections with clearer legacy, recommended and multimodal scenarios - replace mineru-centric examples with native-teP and legacy-R combinations - add new comprehensive ParagraphSemanticChunking.md with full P strategy documentation - remove outdated native-only docx examples and docling references - align zh docs with en structure and terminology
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dummy PR: Merge Dev to Main (Never try to merge this PR)